1 Testing Data Assumptions

\(RI_{i} \sim N(\mu_{i}, \sigma_{i})\)

The score is the normal distribution kernel:

Figure 1: Assumptions of the Fiehn retention index score.

  1. Experimental RI follows a normal distribution

  2. Experimental RI \(\bar{x}\) equals Theoretical RI \(\mu\)

  3. Experimental RI \(s\) are equal

  4. An implied data assumption is that \(\mu\) is independent of \(\sigma\). This is useful to know for when we select machine learning models.

  5. Experimental RI \(s\) depends on metabolite mass

1.1 RI: Normal Distribution

Number.of.Metabolites Fails.Shapiro.Test Passes.Shapiro.Test
88 86 2

Figure 2: An example of a metabolite that passes (right) and fails (left) the Shapiro-Wilk’s test for normality.

Goal: To determine how many of the 87 metabolites with an n >= 30 follow a normal distribution, we ran a Shapiro-Wilks test for normality with an \(\alpha\) level of 0.05. The null hypothesis \(N_{0}\) is that the distribution is normal.

Notes: A majority of metabolites fail the test for normality; therefore, we reject the null that retention index follows a normal distribution.

1.2 RI: Experimental \(\bar{x}\) = 0

Number.of.Metabolites Fails.T.Test Passes.T.Test
88 86 2

Figure 3: The distribution of retention index difference means, which should be close to 0.

Goal: To determine how many of the 87 metabolites with an n >= 30 have a true \(\mu\) of 0, we ran one sample t-test with an \(\alpha\) level of 0.05. The null hypothesis \(N_{0}\) is that \(\mu\) is 0.

Notes: All metabolites fail the Student’s T-Test; therefore, we reject the null that the true retention index mean is 0.

1.3 RI: \(s\) are equal

Figure 4: The distribution of retention index standard deviations, which should be close to 3.

Goal: To test whether the variance of the retention index difference per metabolite is consistent. We would expect a histogram to show similar standard deviation values.

Notes: The standard deviation ranges from 0.0106415 to 16.0892242 and the variance from 1.132407^{-4} to 258.8631368. An F-test for equal variance test between two metabolites. An F-test between the highest and lowest \(\sigma^2\) is less than 0.01. We can reject the null that the variances are equal across all metabolite retention index differences.

1.4 RI: \(\mu\) independent of \(\sigma\)

## 
##  Spearman's rank correlation rho
## 
## data:  abs(Anno30_StatsTest$Mean) and Anno30_StatsTest$SD
## S = 95630, p-value = 0.1415
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1579198

Figure 5: The relationship between the standard deviation and the mean of each metabolite’s retention index distribution.

We can reject the assumption that \(\mu\) is independent of \(\sigma\).

Goal: The fourth data assumption is that the standard deviation \(\sigma\) is independent from the mean \(\mu\) for the retention index difference per metabolite. An easy way to determine wheter there is a relationship between \(\mu\) and \(\sigma\) is to test their relationship with a non parametric correlation test.

Notes: There is likely a relationship between \(\mu\) and \(\sigma\).

1.5 Conclusion

Assumption Pass/Fail
RID is Normal Fail
RID mean is 0 Fail
RID SD are equal Fail
RID mean is independent of SD Fail

2 Effect of Violating Assumptions

2.1 No Violations

Figure 6: The distribution of RI scores when no assumptions of the score is violated. 1000 draws were taken from a normal distribution with a mean of 0, a sd of 3, and a skew of 0 (all normal distributions are assumed to have no skew).

Above are the results of calculating the retention index score of the expected distribution where no violations are made (Fig. 6). It’s worth noting that the median score 0.794157 is not particularly high, and between 0-1 there are no outliers. Next, we will test how the score changes when the mean assumption is violated.

2.2 Mean Violation

Figure 7: Distribution of retention index scores if the mean assumption of 0 is violated and the rest of the assumptions are fixed for means -35 to 35 in units of 5 (left) or from means -10 to 10 in units of 1 (right).

The median RI score drops below 0.5 as soon as the true mean is +/-4 from the expected mean (Fig. 7). Since our true means range from -35 to 35, this means the current RI score is not sufficient for our needs, especially at ranges +/-10 when there are no scores above 0.4. It is worth noting that the effect on RI score is symmetrical across 0. With this in mind, a good range to simulate means would be between 0 and 6 since the RI score median approaches 0 around 6.

2.3 SD Violation

Figure 8: Distribution of retention index scores if the standard deviation assumption of 3 is violated and the rest of the assumptions are fixed for the range of 0.01 to 2.81 (top) and 3.5 to 15 (bottom).

If the true standard deviation of the assumed distribution is less than 3 and approaches 0, the median retention index distribution shifts toward 1 (Figure 8). If the true standard deviation is greater than 3 and approaches \(\infty\), the median retention index score shifts toward 0. A good range for simulation is from 0.01 to 8, or, for a more focused range from 3-8.

2.4 Skew Violation

Figure 9: Distribution of retention index scores if the skew assumption of 0 is violated and the rest of the assumptions are fixed.

Since normal distributions don’t have a parameter for skew, we used the gamma distribution (erlang) and derived the proper shape and rate parameters given a change in skew. Fig. 9 shows the results of this estimation. As the skew value approaches 0 from either side, the inner quartile range (IQR) grows larger and the median is dragged away from 1. Interestingly, violating only the skew assumption results in better retention index scores. Since the effect is symmetrical through 0, we chose a range of skews between 0-4.

2.5 Multiple Assumption Violations

Figure 10: Summary statistics of the simulated distribution versus the desired statistic for mean (left), standard deviation (center), and skew (right).

To simulate the effect of violating multiple assumption at once, a gamma (erlang) distribution was used to simulate values. All simulated values are within 0.5 units of their target value (Fig. 10). See the ri_score_multiple_violations script for specifics on how this was accomplished.

Figure 11: The effect that adjusting mean, standard deviation, and skew combinatorially has on the median retention index score (n = 10,000 for each simulated distribution, indicated by a point).

In Fig. 11, we see the effects of violating the mean, standard deviation, and skew assumptions combinatorially. Here, we see that more extreme standard deviations and means have less of an impact on overall RI scores at more extreme skews.

The next step is to investigate how adjusting the RI score to each distribution’s true mean and standard deviation affects the score ranks, since the original score is based on a normal kernel and has no adjustment for skew. Later, we will investigate other distributions that both do and do not account for skew.

3 Publication Figure

4 Poster Figures